{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "96282d95-63c9-4d09-bf43-6dccd5076046",
   "metadata": {},
   "source": [
    "[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/master/docs/release/howto/udfs-in-pixeltable.ipynb)&nbsp;&nbsp;\n",
    "[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pixeltable/pixeltable/blob/master/docs/release/howto/udfs-in-pixeltable.ipynb)\n",
    "\n",
    "# UDFs in Pixeltable\n",
    "\n",
    "Pixeltable comes with a library of built-in functions and integrations, but sooner or later, you'll want to introduce some customized logic into your workflow. This is where Pixeltable's rich UDF (User-Defined Function) capability comes in. Pixeltable UDFs let you write code in Python, then directly insert your custom logic into Pixeltable expressions and computed columns. In this how-to guide, we'll show how to define UDFs, extend their capabilities, and use them in computed columns.\n",
    "\n",
    "To start, we'll install the necessary dependencies, create a Pixeltable directory and table to experiment with, and add some sample data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d35cd5f2-365b-43c6-94b2-0f4e308ca2a6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-25T22:10:42.859460Z",
     "iopub.status.busy": "2024-05-25T22:10:42.858991Z",
     "iopub.status.idle": "2024-05-25T22:10:44.777272Z",
     "shell.execute_reply": "2024-05-25T22:10:44.776720Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q pixeltable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "638d95c1-7c2f-4f38-a4e7-e7eaf830c881",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata\n",
      "Created directory `udf_demo`.\n",
      "Created table `strings`.\n",
      "Computing cells:   0%|                                                    | 0/2 [00:00<?, ? cells/s]\n",
      "Inserting rows into `strings`: 2 rows [00:00, 1338.54 rows/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 593.46 cells/s]\n",
      "Inserted 2 rows with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                    input\n",
       "0                           Hello, world!\n",
       "1  You can do a lot with Pixeltable UDFs."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Inserted 2 rows with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                    input\n",
       "0                           Hello, world!\n",
       "1  You can do a lot with Pixeltable UDFs."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pixeltable as pxt\n",
    "\n",
    "# Create the directory and table\n",
    "pxt.drop_dir('udf_demo', force=True)  # Ensure a clean slate for the demo\n",
    "pxt.create_dir('udf_demo')\n",
    "t = pxt.create_table('udf_demo.strings', {'input': pxt.StringType()})\n",
    "\n",
    "# Add some sample data\n",
    "t.insert([{'input': 'Hello, world!'}, {'input': 'You can do a lot with Pixeltable UDFs.'}])\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a6310aa-ade3-4c0e-8e69-1e9b6f7ebf6d",
   "metadata": {},
   "source": [
    "## What is a UDF?\n",
    "\n",
    "A Pixeltable UDF is just a Python function that is marked with the `@pxt.udf` decorator.\n",
    "\n",
    "```python\n",
    "@pxt.udf\n",
    "def add_one(n: int) -> int:\n",
    "    return n + 1\n",
    "```\n",
    "\n",
    "It's as simple as that! Without the decorator, `add_one` would be an ordinary Python function that operates on integers. Adding `@pxt.udf` converts it into a Pixeltable function that operates on _columns_ of integers. The decorated function can then be used directly to define computed columns; Pixeltable will orchestrate its execution across all the input data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54640b5d-5192-41ed-be80-82ffd1e140e0",
   "metadata": {},
   "source": [
    "For our first working example, let's do something slightly more interesting: write a function to extract the longest word from a sentence. (If there are ties for the longest word, we choose the first word among those ties.) In Python, that might look something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d3ca2d38-4529-4fba-ad67-eb230a4d92d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def longest_word(sentence: str, strip_punctuation: bool = False) -> str:\n",
    "    words = sentence.split()\n",
    "    if strip_punctuation:  # Remove non-alphanumeric characters from each word\n",
    "        words = [''.join(filter(str.isalnum, word)) for word in words]\n",
    "    i = np.argmax([len(word) for word in words])\n",
    "    return words[i]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "55629458-ca5f-437d-81bf-5e040e0f886f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'check'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "longest_word(\"Let's check that it works.\", strip_punctuation=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "099e7b23-170c-4e62-81ce-61f1ac841df2",
   "metadata": {},
   "source": [
    "The `longest_word` Python function isn't a Pixeltable UDF (yet); it operates on individual strings, not columns of strings. Adding the decorator turns it into a UDF:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0e8bfcf9-9fef-44a6-83cc-b8232198ae18",
   "metadata": {},
   "outputs": [],
   "source": [
    "@pxt.udf\n",
    "def longest_word(sentence: str, strip_punctuation: bool = False) -> str:\n",
    "    words = sentence.split()\n",
    "    if strip_punctuation:  # Remove non-alphanumeric characters from each word\n",
    "        words = [''.join(filter(str.isalnum, word)) for word in words]\n",
    "    i = np.argmax([len(word) for word in words])\n",
    "    return words[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "936c0bb6-9ca6-4b66-873c-0c7cd6664856",
   "metadata": {},
   "source": [
    "Now we can use it to create a computed column. Pixeltable orchestrates the computation like it does with any other function, applying the UDF in turn to each existing row of the table, then updating incrementally each time a new row is added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "32dd5d6f-eb8a-4db9-b224-238a913ff0c5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing cells: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 370.78 cells/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 538.11 cells/s]\n",
      "Added 2 column values with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                    input longest_word\n",
       "0                           Hello, world!       Hello,\n",
       "1  You can do a lot with Pixeltable UDFs.   Pixeltable"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t['longest_word'] = longest_word(t.input)\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "61d81cc5-37ee-4d19-8f68-af9f6f8b7fad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing cells:   0%|                                                    | 0/3 [00:00<?, ? cells/s]\n",
      "Inserting rows into `strings`: 1 rows [00:00, 339.89 rows/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 420.57 cells/s]\n",
      "Inserted 1 row with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word\n",
       "0                             Hello, world!          Hello,\n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable\n",
       "2  Pixeltable updates tables incrementally.  incrementally."
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "Inserting rows into `strings`: 1 rows [00:00, 1766.77 rows/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\r",
      "Computing cells: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 648.97 cells/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Inserted 1 row with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word\n",
       "0                             Hello, world!          Hello,\n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable\n",
       "2  Pixeltable updates tables incrementally.  incrementally."
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t.insert(input='Pixeltable updates tables incrementally.')\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dc596c5-bcea-4df8-94c1-2aea6e0fbfdd",
   "metadata": {},
   "source": [
    "Oops, those trailing punctuation marks are kind of annoying. Let's add another column, this time using the handy `strip_punctuation` parameter from our UDF. (We could alternatively drop the first column before adding the new one, but for purposes of this tutorial it's convenient to see how Pixeltable executes both variants side-by-side.) Note how _columns_ such as `t.input` and _constants_ such as `True` can be freely intermixed as arguments to the UDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "847d6416-b430-4fe9-b951-1c2147174c1e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 404.05 cells/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 705.36 cells/s]\n",
      "Added 3 column values with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "      <th>longest_word_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "      <td>Hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "      <td>incrementally</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word longest_word_2\n",
       "0                             Hello, world!          Hello,          Hello\n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable     Pixeltable\n",
       "2  Pixeltable updates tables incrementally.  incrementally.  incrementally"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t['longest_word_2'] = longest_word(t.input, strip_punctuation=True)\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5811fb4-a9cd-46b3-bcdf-082880bebf65",
   "metadata": {},
   "source": [
    "## Types in UDFs\n",
    "\n",
    "You might have noticed that the `longest_word` UDF has _type hints_ in its signature.\n",
    "\n",
    "```python\n",
    "def longest_word(sentence: str, strip_punctuation: bool = False) -> str: ...\n",
    "```\n",
    "\n",
    "The `sentence` parameter, `strip_punctuation` parameter, and return value all have explicit types (`str`, `bool`, and `str` respectively). In general Python code, type hints are usually optional. But Pixeltable is a database system: _everything_ in Pixeltable must have a type. And since Pixeltable is also an orchestrator - meaning it sets up workflows and computed columns _before_ executing them - these types need to be known in advance. That's the reasoning behind a fundamental principle of Pixeltable UDFs:\n",
    "- Type hints are _required_.\n",
    "\n",
    "You can turn almost any Python function into a Pixeltable UDF, provided that it has type hints, and provided that Pixeltable supports the types that it uses. The most familiar types that you'll use in UDFs are:\n",
    "- `int`\n",
    "- `float`\n",
    "- `str`\n",
    "- `list` (can optionally be parameterized, e.g., `list[str]`)\n",
    "- `dict` (can optionally be parameterized, e.g., `dict[str, int]`)\n",
    "- `PIL.Image.Image`\n",
    "\n",
    "In addition to these standard Python types, Pixeltable also recognizes various kinds of arrays, audio and video media, and documents. For a full discussion of Pixeltable types, see the [Pixeltable Type System](TODO) howto guide."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2527b31d-ce08-4a09-adee-3c0f3a6621e5",
   "metadata": {},
   "source": [
    "## Local and Module UDFs\n",
    "\n",
    "The `longest_word` UDF that we defined above is a _local_ UDF: it was defined directly in our notebook, rather than in a module that we imported. Many other UDFs, including all of Pixeltable's built-in functions, are defined in modules. We encountered a few of these in the Pixeltable Basics tutorial: the `huggingface.detr_for_object_detection` and `openai.vision` functions. (Although these are built-in functions, they behave the same way as UDFs, and in fact they're defined the same way under the covers.)\n",
    "\n",
    "There is an important difference between the two. When you add a module UDF such as `openai.vision` to a table, Pixeltable stores a _reference_ to the corresponding Python function in the module. If you later restart your Python runtime and reload Pixeltable, then Pixeltable will re-import the module UDF when it loads the computed column. This means that any code changes made to the UDF will be picked up at that time, and the new version of the UDF will be used in any future execution.\n",
    "\n",
    "Conversely, when you add a local UDF to a table, the _entire code_ for the UDF is serialized and stored in the table. This ensures that if you restart your notebook kernel (say), or even delete the notebook entirely, the UDF will continue to function. However, it also means that if you modify the UDF code, the updated logic will not be reflected in any existing Pixeltable columns.\n",
    "\n",
    "To see how this works in practice, let's modify our `longest_word` UDF so that if `strip_punctuation` is `True`, then we remove only a single punctuation mark from the _end_ of each word."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "fdb64049-12c3-4cd6-a6e0-a798aa46670a",
   "metadata": {},
   "outputs": [],
   "source": [
    "@pxt.udf\n",
    "def longest_word(sentence: str, strip_punctuation: bool = False) -> str:\n",
    "    words = sentence.split()\n",
    "    if strip_punctuation:\n",
    "        words = [\n",
    "            word if word[-1].isalnum() else word[:-1]\n",
    "            for word in words\n",
    "        ]\n",
    "    i = np.argmax([len(word) for word in words])\n",
    "    return words[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d67ba18-7ed7-4add-9980-831a7a889253",
   "metadata": {},
   "source": [
    "Now we see that Pixeltable continues to use the _old_ definition, even as new rows are added to the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "52a8cb29-9d81-4fd1-b3ba-8eb86370821a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing cells:   0%|                                                    | 0/5 [00:00<?, ? cells/s]\n",
      "Inserting rows into `strings`: 1 rows [00:00, 301.10 rows/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 5/5 [00:00<00:00, 699.03 cells/s]\n",
      "Inserted 1 row with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "      <th>longest_word_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "      <td>Hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "      <td>incrementally</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Let&#x27;s check that it still works.</td>\n",
       "      <td>works.</td>\n",
       "      <td>check</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word longest_word_2\n",
       "0                             Hello, world!          Hello,          Hello\n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable     Pixeltable\n",
       "2  Pixeltable updates tables incrementally.  incrementally.  incrementally\n",
       "3          Let's check that it still works.          works.          check"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "Inserting rows into `strings`: 1 rows [00:00, 1552.87 rows/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\r",
      "Computing cells: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 1254.65 cells/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Inserted 1 row with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "      <th>longest_word_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "      <td>Hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "      <td>incrementally</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Let's check that it still works.</td>\n",
       "      <td>works.</td>\n",
       "      <td>check</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word longest_word_2\n",
       "0                             Hello, world!          Hello,          Hello\n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable     Pixeltable\n",
       "2  Pixeltable updates tables incrementally.  incrementally.  incrementally\n",
       "3          Let's check that it still works.          works.          check"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t.insert(input=\"Let's check that it still works.\")\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b64c40f1-b7f0-416c-ac15-57f624d7dabb",
   "metadata": {},
   "source": [
    "But if we add a new _column_ that references the `longest_word` UDF, Pixeltable will use the updated version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b62c1601-ca63-475e-ab2b-76edaac37ce0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing cells: 100%|███████████████████████████████████████████| 4/4 [00:00<00:00, 568.18 cells/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 4/4 [00:00<00:00, 570.60 cells/s]\n",
      "Added 4 column values with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "      <th>longest_word_2</th>\n",
       "      <th>longest_word_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "      <td>Hello</td>\n",
       "      <td>Hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "      <td>incrementally</td>\n",
       "      <td>incrementally</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Let&#x27;s check that it still works.</td>\n",
       "      <td>works.</td>\n",
       "      <td>check</td>\n",
       "      <td>Let&#x27;s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word longest_word_2  \\\n",
       "0                             Hello, world!          Hello,          Hello   \n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable     Pixeltable   \n",
       "2  Pixeltable updates tables incrementally.  incrementally.  incrementally   \n",
       "3          Let's check that it still works.          works.          check   \n",
       "\n",
       "  longest_word_3  \n",
       "0          Hello  \n",
       "1     Pixeltable  \n",
       "2  incrementally  \n",
       "3          Let's  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t['longest_word_3'] = longest_word(t.input, strip_punctuation=True)\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f12d7c2-b5dd-446a-997d-5510131a6ade",
   "metadata": {},
   "source": [
    "The general rule is: changes to module UDFs will affect any future execution; changes to local UDFs will only affect _new columns_ that are defined using the new version of the UDF."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42158866-8127-405c-af25-cd59d0acef8c",
   "metadata": {},
   "source": [
    "## Batching\n",
    "\n",
    "Pixeltable provides several ways to optimize UDFs for better performance. One of the most common is _batching_, which is particularly important for UDFs that involve GPU operations.\n",
    "\n",
    "Ordinary UDFs process one row at a time, meaning the UDF will be invoked exactly once per row processed. Conversely, a batched UDF processes several rows at a time; the specific number is user-configurable. As an example, let's modify our `longest_word` UDF to take a batched parameter. Here's what it looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7bb26ff0-155d-4c51-a19f-92d235195d61",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pixeltable.func import Batch\n",
    "\n",
    "@pxt.udf(batch_size=16)\n",
    "def longest_word(sentences: Batch[str], strip_punctuation: bool = False) -> Batch[str]:\n",
    "    results = []\n",
    "    for sentence in sentences:\n",
    "        words = sentence.split()\n",
    "        if strip_punctuation:\n",
    "            words = [\n",
    "                word if word[-1].isalnum() else word[:-1]\n",
    "                for word in words\n",
    "            ]\n",
    "        i = np.argmax([len(word) for word in words])\n",
    "        results.append(words[i])\n",
    "    return results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd41d42a-2df0-4b14-b000-ab1cbf4b22a8",
   "metadata": {},
   "source": [
    "There are several changes:\n",
    "- The parameter `batch_size=16` has been added to the `@pxt.udf` decorator, specifying the batch size;\n",
    "- The `sentences` parameter has changed from `str` to `Batch[str]`;\n",
    "- The return type has also changed from `str` to `Batch[str]`; and\n",
    "- Instead of processing a single sentence, the UDF is processing a `Batch` of sentences and returning the result `Batch`.\n",
    "\n",
    "What exactly is a `Batch[str]`? Functionally, it's simply a `list[str]`, and you can use it exactly like a `list[str]` in any Python code. The only difference is in the type hint; a type hint of `Batch[str]` tells Pixeltable, \"My data consists of individual strings that I want you to process in batches\". Conversely, a type hint of `list[str]` would mean, \"My data consists of _lists_ of strings that I want you to process one at a time\".\n",
    "\n",
    "Notice that the `strip_punctuation` parameter is _not_ wrapped in a `Batch` type. This because `strip_punctuation` controls the behavior of the UDF, rather than being part of the input data. When we use the batched `longest_word` UDF, the `strip_punctuation` parameter will always be a constant, not a column.\n",
    "\n",
    "Let's put the new, batched UDF to work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3a1b54bb-044d-4235-b9ee-e9ad2adf572d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computing cells: 100%|███████████████████████████████████████████| 4/4 [00:00<00:00, 497.26 cells/s]\n",
      "Computing cells: 100%|███████████████████████████████████████████| 4/4 [00:00<00:00, 736.52 cells/s]\n",
      "Added 4 column values with 0 errors.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>input</th>\n",
       "      <th>longest_word</th>\n",
       "      <th>longest_word_2</th>\n",
       "      <th>longest_word_3</th>\n",
       "      <th>longest_word_3_batched</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>Hello, world!</td>\n",
       "      <td>Hello,</td>\n",
       "      <td>Hello</td>\n",
       "      <td>Hello</td>\n",
       "      <td>Hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>You can do a lot with Pixeltable UDFs.</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "      <td>Pixeltable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Pixeltable updates tables incrementally.</td>\n",
       "      <td>incrementally.</td>\n",
       "      <td>incrementally</td>\n",
       "      <td>incrementally</td>\n",
       "      <td>incrementally</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>Let&#x27;s check that it still works.</td>\n",
       "      <td>works.</td>\n",
       "      <td>check</td>\n",
       "      <td>Let&#x27;s</td>\n",
       "      <td>Let&#x27;s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "                                      input    longest_word longest_word_2  \\\n",
       "0                             Hello, world!          Hello,          Hello   \n",
       "1    You can do a lot with Pixeltable UDFs.      Pixeltable     Pixeltable   \n",
       "2  Pixeltable updates tables incrementally.  incrementally.  incrementally   \n",
       "3          Let's check that it still works.          works.          check   \n",
       "\n",
       "  longest_word_3 longest_word_3_batched  \n",
       "0          Hello                  Hello  \n",
       "1     Pixeltable             Pixeltable  \n",
       "2  incrementally          incrementally  \n",
       "3          Let's                  Let's  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t['longest_word_3_batched'] = longest_word(t.input, strip_punctuation=True)\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c487704a-7e88-4d53-bf9e-8838e8a6a6f0",
   "metadata": {},
   "source": [
    "As expected, the output of the `longest_word_3_batched` column is identical to the `longest_word_3` column. Under the covers, though, Pixeltable is orchestrating execution in batches of 16. That probably won't have much performance impact on our toy example, but for GPU-bound computations such as text or image embeddings, it can make a substantial difference."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
